Digital Leafleting: Extracting Structured Data from Multimedia Online Flyers

نویسندگان

  • Emilia Apostolova
  • Payam Pourashraf
  • Jeffrey Sack
چکیده

Marketing materials such as flyers and other infographics are a vast online resource. In a number of industries, such as the commercial real estate industry, they are in fact the only authoritative source of information. Companies attempting to organize commercial real estate inventories spend a significant amount of resources on manual data entry of this information. In this work, we propose a method for extracting structured data from free-form commercial real estate flyers in PDF and HTML formats. We modeled the problem as text categorization and Named Entity Recognition (NER) tasks and applied a supervised machine learning approach (Support Vector Machines). Our dataset consists of more than 2,200 commercial real estate flyers and associated manually entered structured data, which was used to automatically create training datasets. Traditionally, text categorization and NER approaches are based on textual information only. However, information in visually rich formats such as PDF and HTML is often conveyed by a combination of textual and visual features. Large fonts, visually salient colors, and positioning often indicate the most relevant pieces of information. We applied novel features based on visual characteristics in addition to traditional text features and show that performance improved significantly for both the text categorization and NER tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multimedia Datamining in Medical Applications

Over the past decades, data mining has proved to be a successful approach for extracting hidden knowledge from huge collections of structured digital data stored in databases. From the inception, data mining was done primarily on numerical set of data. Nowadays, large multimedia data sets such as audio, speech, text, web, image, video and combination of several types are becoming increasingly a...

متن کامل

Content Extraction from Marketing Flyers

The rise of online shopping has hurt physical retailers, which struggle to persuade customers to buy products in physical stores rather than online. Marketing flyers are a great mean to increase the visibility of physical retailers, but the unstructured offers appearing in those documents cannot be easily compared with similar online deals, making it hard for a customer to understand whether it...

متن کامل

8 Video Data Mining

Data mining, which is defined as the process of extracting previously unknown knowledge and detecting interesting patterns from a massive set of data, has been an active research area. As a result, several commercial products and research prototypes are available nowadays. However, most of these studies have focused on corporate data — typically in an alpha-numeric database, and relatively less...

متن کامل

Digital Watermarking Technology in Different Domains

Due to high speed computer networks, the use of digitally formatted data has increased many folds.The digital data can be duplicated and edited with great ease which has led to a need for effectivecopyright protection tools. Digital Watermarking is a technology of embedding watermark withintellectual property rights into images, videos, audios and other multimedia data by a certainalgorithm .Di...

متن کامل

Steganography Scheme Based on Reed-Muller Code with Improving Payload and Ability to Retrieval of Destroyed Data for Digital Images

In this paper, a new steganography scheme with high embedding payload and good visual quality is presented. Before embedding process, secret information is encoded as block using Reed-Muller error correction code. After data encoding and embedding into the low-order bits of host image, modulus function is used to increase visual quality of stego image. Since the proposed method is able to embed...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015